[Feature][Response API] Add streaming support for non-harmony #23741

kebe7jun · 2025-08-27T11:32:21Z

Purpose

Add streaming support for non-harmony

Related issue #23225

Test Plan

Unit tests and self tests(see result).

Test Result

GPT-OSS Stream output

ResponseCreatedEvent(response=Response(id='resp_3bc9f13acb90485daa3d1694ac9ea14c', created_at=1756274867.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='in_progress', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_3bc9f13acb90485daa3d1694ac9ea14c', created_at=1756274867.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='in_progress', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseReasoningItem(id='', summary=[], type='reasoning', content=None, encrypted_content=None, status='in_progress'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseReasoningTextDeltaEvent(content_index=0, delta='User', item_id='', output_index=0, sequence_number=4, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=0, delta=' wants', item_id='', output_index=0, sequence_number=5, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=0, delta=' us', item_id='', output_index=0, sequence_number=6, type='response.reasoning_text.delta')
...
ResponseReasoningTextDeltaEvent(content_index=0, delta=' but', item_id='', output_index=0, sequence_number=110, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=0, delta=' okay', item_id='', output_index=0, sequence_number=111, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=0, delta='.', item_id='', output_index=0, sequence_number=112, type='response.reasoning_text.delta')
ResponseReasoningTextDoneEvent(content_index=0, item_id='', output_index=1, sequence_number=113, text='User wants us to say \'double bubble bath\' ten times fast. We need to comply? It\'s a nonsensical request but presumably no policy violation. It\'s a benign language request. We can comply by repeating phrase 10 times quickly. Should we maybe output a line like "double bubble bath" repeated 10 times quickly. That\'s fine.\n\nNo policy conflicts. The phrase is not disallowed. So we comply.\n\nWe should produce "double bubble bath double bubble bath ... " repeated 10 times. be mindful it\'s too much but okay.', type='response.reasoning_text.done')
ResponseOutputItemDoneEvent(item=ResponseReasoningItem(id='', summary=[], type='reasoning', content=[Content(text='User wants us to say \'double bubble bath\' ten times fast. We need to comply? It\'s a nonsensical request but presumably no policy violation. It\'s a benign language request. We can comply by repeating phrase 10 times quickly. Should we maybe output a line like "double bubble bath" repeated 10 times quickly. That\'s fine.\n\nNo policy conflicts. The phrase is not disallowed. So we comply.\n\nWe should produce "double bubble bath double bubble bath ... " repeated 10 times. be mindful it\'s too much but okay.', type='reasoning_text')], encrypted_content=None, status='completed'), output_index=1, sequence_number=114, type='response.output_item.done')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='', content=[], role='assistant', status='in_progress', type='message'), output_index=1, sequence_number=115, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='', output_index=1, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=116, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='double', item_id='', logprobs=[], output_index=1, sequence_number=117, type='response.output_text.delta')
ResponseTextDeltaEvent(content_index=0, delta=' bubble', item_id='', logprobs=[], output_index=1, sequence_number=118, type='response.output_text.delta')
...
ResponseTextDeltaEvent(content_index=0, delta=' bubble', item_id='', logprobs=[], output_index=1, sequence_number=145, type='response.output_text.delta')
ResponseTextDeltaEvent(content_index=0, delta=' bath', item_id='', logprobs=[], output_index=1, sequence_number=146, type='response.output_text.delta')
ResponseTextDoneEvent(content_index=0, item_id='', logprobs=[], output_index=2, sequence_number=147, text='double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='', output_index=2, part=ResponseOutputText(annotations=[], text='double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath', type='output_text', logprobs=None), sequence_number=148, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='', content=[ResponseOutputText(annotations=[], text='double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath', type='output_text', logprobs=None)], role='assistant', status='completed', type='message'), output_index=2, sequence_number=149, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_3bc9f13acb90485daa3d1694ac9ea14c', created_at=1756274867.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='completed', text=None, top_logprobs=None, truncation='disabled', usage=ResponseUsage(input_tokens=81, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=149, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=230), user=None), sequence_number=150, type='response.completed')

Qwen3 30B A3B Stream output

ResponseCreatedEvent(response=Response(id='resp_a01680e6fda64355bdb4eccd95db366a', created_at=1756866839.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=0.6, tool_choice='auto', tools=[], top_p=0.95, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='in_progress', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_a01680e6fda64355bdb4eccd95db366a', created_at=1756866839.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=0.6, tool_choice='auto', tools=[], top_p=0.95, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='in_progress', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseReasoningItem(id='', summary=[], type='reasoning', content=None, encrypted_content=None, status='in_progress'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseReasoningTextDeltaEvent(content_index=1, delta='\n', item_id='', output_index=0, sequence_number=4, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=2, delta='Okay', item_id='', output_index=0, sequence_number=5, type='response.reasoning_text.delta')
...
ResponseReasoningTextDeltaEvent(content_index=256, delta='.\n', item_id='', output_index=0, sequence_number=259, type='response.reasoning_text.delta')
ResponseReasoningTextDoneEvent(content_index=257, item_id='', output_index=0, sequence_number=260, text='\nOkay, the user wants me to say "double bubble bath" ten times fast. Let me start by repeating it as instructed. I need to make sure I do it quickly but still clearly. Let me count each repetition. First time: double bubble bath. Second: double bubble bath. Third... Wait, maybe I should check if there\'s a specific way to pronounce it. "Double" is pronounced like "dub-uhl", "bubble" is "buh-buhl", and "bath" is "bath". So putting it all together: "dub-uhl buh-buhl bath". I should make sure each word is distinct but the whole phrase flows smoothly. Let me try again, faster. Double bubble bath, double bubble bath... Hmm, maybe I can practice a few times to get the rhythm right. Also, the user might be testing my ability to follow instructions or maybe it\'s a fun exercise. I should keep it light and not overthink it. Just repeat it ten times as fast as possible without making mistakes. Let me count: 1, 2, 3... up to 10. Okay, that should do it. I\'ll make sure the response is clear and matches the user\'s request.\n', type='response.reasoning_text.done')
ResponseOutputItemDoneEvent(item=ResponseReasoningItem(id='', summary=[], type='reasoning', content=[Content(text='\nOkay, the user wants me to say "double bubble bath" ten times fast. Let me start by repeating it as instructed. I need to make sure I do it quickly but still clearly. Let me count each repetition. First time: double bubble bath. Second: double bubble bath. Third... Wait, maybe I should check if there\'s a specific way to pronounce it. "Double" is pronounced like "dub-uhl", "bubble" is "buh-buhl", and "bath" is "bath". So putting it all together: "dub-uhl buh-buhl bath". I should make sure each word is distinct but the whole phrase flows smoothly. Let me try again, faster. Double bubble bath, double bubble bath... Hmm, maybe I can practice a few times to get the rhythm right. Also, the user might be testing my ability to follow instructions or maybe it\'s a fun exercise. I should keep it light and not overthink it. Just repeat it ten times as fast as possible without making mistakes. Let me count: 1, 2, 3... up to 10. Okay, that should do it. I\'ll make sure the response is clear and matches the user\'s request.\n', type='reasoning_text')], encrypted_content=None, status='completed'), output_index=0, sequence_number=261, type='response.output_item.done')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=262, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='', output_index=1, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=263, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=1, delta='\n\n', item_id='', logprobs=[], output_index=1, sequence_number=264, type='response.output_text.delta')
ResponseTextDeltaEvent(content_index=2, delta='Double', item_id='', logprobs=[], output_index=1, sequence_number=265, type='response.output_text.delta')
...
ResponseTextDeltaEvent(content_index=42, delta='', item_id='', logprobs=[], output_index=1, sequence_number=305, type='response.output_text.delta')
ResponseTextDoneEvent(content_index=43, item_id='', logprobs=[], output_index=1, sequence_number=306, text='\n\nDouble bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath.', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=44, item_id='', output_index=1, part=ResponseOutputText(annotations=[], text='\n\nDouble bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath.', type='output_text', logprobs=None), sequence_number=307, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='', content=[ResponseOutputText(annotations=[], text='\n\nDouble bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath.', type='output_text', logprobs=None)], role='assistant', status='completed', type='message', summary=[]), output_index=1, sequence_number=308, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_a01680e6fda64355bdb4eccd95db366a', created_at=1756866839.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=0.6, tool_choice='auto', tools=[], top_p=0.95, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='completed', text=None, top_logprobs=None, truncation='disabled', usage=ResponseUsage(input_tokens=18, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=300, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=318), user=None), sequence_number=309, type='response.completed')

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

kebe7jun · 2025-08-28T03:49:32Z

@heheda12345 PTAL

heheda12345

Thanks for your contribution. Some small comments.

vllm/entrypoints/context.py

vllm/entrypoints/openai/serving_responses.py

heheda12345 · 2025-09-02T21:44:55Z

vllm/entrypoints/openai/serving_responses.py

    ) -> AsyncGenerator[str, None]:
-        sequence_number = 0
+        current_content_index = 0  # FIXME: this number is never changed


Can you fix these indexes? Reference: #23382

heheda12345 · 2025-09-03T07:28:53Z

vllm/entrypoints/openai/serving_responses.py

@@ -864,7 +861,7 @@ async def _process_simple_streaming_events(
        created_time: int,
        _send_event: Callable[[BaseModel], str],
    ) -> AsyncGenerator[str, None]:
-        current_content_index = 0  # FIXME: this number is never changed
+        current_content_index = 0
        current_output_index = 0
        current_item_id = ""  # FIXME: this number is never changed


Thanks for the quick update. Can you also update the "current_item_id"?

Thank you for the reminder, my apologies for the oversight, fixed.

heheda12345

LGTM! Thanks for your contribution.

heheda12345 · 2025-09-03T20:57:58Z

@kebe7jun The v1-test-entrypoints CI failure seems to be related to this PR. Can you take a look?

 v1/entrypoints/openai/responses/test_basic.py::test_streaming - TypeError: 'AsyncStream' object is not iterable

Signed-off-by: Kebe <[email protected]>

…roject#23741) Signed-off-by: Kebe <[email protected]> Signed-off-by: JasonZhu1313 <[email protected]>

kebe7jun force-pushed the feature/responses-api-streaming branch from 8dc2da4 to b65638e Compare August 27, 2025 11:32

mergify bot added the frontend label Aug 27, 2025

kebe7jun force-pushed the feature/responses-api-streaming branch from b65638e to 3bb6902 Compare August 27, 2025 11:46

mergify bot added the v1 label Aug 27, 2025

kebe7jun marked this pull request as ready for review August 27, 2025 11:55

kebe7jun requested a review from aarnphm as a code owner August 27, 2025 11:55

kebe7jun force-pushed the feature/responses-api-streaming branch 2 times, most recently from 6d9fe9c to af25d9a Compare August 28, 2025 01:37

heheda12345 reviewed Sep 2, 2025

View reviewed changes

kebe7jun force-pushed the feature/responses-api-streaming branch 3 times, most recently from 77bd0aa to bc4c5ae Compare September 3, 2025 03:18

heheda12345 reviewed Sep 3, 2025

View reviewed changes

kebe7jun force-pushed the feature/responses-api-streaming branch from bc4c5ae to cf993d1 Compare September 3, 2025 07:51

heheda12345 approved these changes Sep 3, 2025

View reviewed changes

heheda12345 enabled auto-merge (squash) September 3, 2025 18:11

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 3, 2025

auto-merge was automatically disabled September 4, 2025 01:14
Head branch was pushed to by a user without write access

kebe7jun force-pushed the feature/responses-api-streaming branch 3 times, most recently from 77ef2de to 30d435e Compare September 4, 2025 04:37

kebe7jun added 3 commits September 4, 2025 04:38

[Feature][Response API] Add streaming support for non-harmony

920be8d

Signed-off-by: Kebe <[email protected]>

address comments

e3bea6d

Signed-off-by: Kebe <[email protected]>

resolve conflict

3e604da

Signed-off-by: Kebe <[email protected]>

kebe7jun force-pushed the feature/responses-api-streaming branch from 30d435e to 3e604da Compare September 4, 2025 04:38

DarkLight1337 merged commit 8f423e5 into vllm-project:main Sep 4, 2025
39 checks passed

kebe7jun deleted the feature/responses-api-streaming branch September 4, 2025 09:49

bfroemel mentioned this pull request Sep 5, 2025

Update v1/responses to be more OpenAI-compatible. sgl-project/sglang#9624

Open

4 tasks

JasonZhu1313 pushed a commit to JasonZhu1313/vllm that referenced this pull request Sep 7, 2025

[Feature][Response API] Add streaming support for non-harmony (vllm-p…

ed12183

…roject#23741) Signed-off-by: Kebe <[email protected]> Signed-off-by: JasonZhu1313 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature][Response API] Add streaming support for non-harmony #23741

[Feature][Response API] Add streaming support for non-harmony #23741

Uh oh!

kebe7jun commented Aug 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

kebe7jun commented Aug 28, 2025

Uh oh!

heheda12345 left a comment

Uh oh!

Uh oh!

Uh oh!

heheda12345 Sep 2, 2025

Uh oh!

kebe7jun Sep 3, 2025

Uh oh!

heheda12345 Sep 3, 2025

Uh oh!

kebe7jun Sep 3, 2025

Uh oh!

heheda12345 left a comment

Uh oh!

heheda12345 commented Sep 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Feature][Response API] Add streaming support for non-harmony #23741

[Feature][Response API] Add streaming support for non-harmony #23741

Uh oh!

Conversation

kebe7jun commented Aug 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

kebe7jun commented Aug 28, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

heheda12345 Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

kebe7jun Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

kebe7jun Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Sep 3, 2025

Uh oh!

Uh oh!

Uh oh!

kebe7jun commented Aug 27, 2025 •

edited by github-actions bot

Loading